2,052 research outputs found

    Crowdsourcing for Speech: Economic, Legal and Ethical analysis

    No full text
    With respect to spoken language resource production, Crowdsourcing - the process of distributing tasks to an open, unspecified population via the internet - offers a wide range of opportunities: populations with specific skills are potentially instantaneously accessible somewhere on the globe for any spoken language. As is the case for most newly introduced high-tech services, crowdsourcing raises both hopes and doubts, certainties and questions. A general analysis of Crowdsourcing for Speech processing could be found in (Eskenazi et al., 2013). This article will focus on ethical, legal and economic issues of crowdsourcing in general (Zittrain, 2008a) and of crowdsourcing services such as Amazon Mechanical Turk (Fort et al., 2011; Adda et al., 2011), a major platform for multilingual language resources (LR) production

    The CAMOMILE collaborative annotation platform for multi-modal, multi-lingual and multi-media documents

    Get PDF
    In this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source.Peer ReviewedPostprint (author's final draft

    Crowdsourcing for Language Resource Development: Criticisms About Amazon Mechanical Turk Overpowering Use

    Get PDF
    International audienceThis article is a position paper about Amazon Mechanical Turk, the use of which has been steadily growing in language processing in the past few years. According to the mainstream opinion expressed in articles of the domain, this type of on-line working platforms allows to develop quickly all sorts of quality language resources, at a very low price, by people doing that as a hobby. We shall demonstrate here that the situation is far from being that ideal. Our goal here is manifold: 1- to inform researchers, so that they can make their own choices, 2- to develop alternatives with the help of funding agencies and scientific associations, 3- to propose practical and organizational solutions in order to improve language resources development, while limiting the risks of ethical and legal issues without letting go price or quality, 4- to introduce an Ethics and Big Data Charter for the documentation of language resourc

    Un turc mécanique pour les ressources linguistiques : critique de la myriadisation du travail parcellisé

    Get PDF
    International audienceThis article is a position paper concerning Amazon Mechanical Turk-like systems, the use of which has been steadily growing in natural language processing in the past few years. According to the mainstream opinion expressed in the articles of the domain, these online working platforms allow to develop very quickly all sorts of quality language resources, for a very low price, by people doing that as a hobby. We shall demonstrate here that the situation is far from being that ideal, be it from the point of view of quality, price, workers' status or ethics. We shall then bring back to mind already existing or proposed alternatives. Our goal here is twofold : to inform researchers, so that they can make their own choices with all the elements of the reflection in mind, and propose practical and organizational solutions in order to improve new language resources development, while limiting the risks of ethical and legal issues without letting go price or quality.Cet article est une prise de position concernant les plate-formes de type Amazon Mechanical Turk, dont l'utilisation est en plein essor depuis quelques années dans le traitement automatique des langues. Ces plateformes de travail en ligne permettent, selon le discours qui prévaut dans les articles du domaine, de faire développer toutes sortes de ressources linguistiques de qualité, pour un prix imbattable et en un temps très réduit, par des gens pour qui il s'agit d'un passe-temps. Nous allons ici démontrer que la situation est loin d'être aussi idéale, que ce soit sur le plan de la qualité, du prix, du statut des travailleurs ou de l'éthique. Nous rappellerons ensuite les solutions alternatives déjà existantes ou proposées. Notre but est ici double : informer les chercheurs, afin qu'ils fassent leur choix en toute connaissance de cause, et proposer des solutions pratiques et organisationnelles pour améliorer le développement de nouvelles ressources linguistiques en limitant les risques de dérives éthiques et légales, sans que cela se fasse au prix de leur coût ou de leur qualité

    "Where the data are coming from?" Ethics, crowdsourcing and traceability for Big Data in Human Language Technology

    No full text
    National audienceBased on the experience gained on the observation of the corpora developement in HLT, the authors want to warn the Big Data community about some recent usage of hu-man computation. For instance, the growing use in the HLT community of crowdsourcing methods, and especially of microworking retributed crowsourcing platforms, lead to many ethical, economical and juridical concerns. The au-thors want also to foster some behaviours, especially con-cerning traceability, implemented in the form of a charter, the Ethics and Big Data Charter

    The NLP4NLP Corpus (I): 50 Years of Publication, Collaboration and Citation in Speech and Language Processing

    Get PDF
    This paper introduces the NLP4NLP corpus, which contains articles published in 34 major conferences and journals in the field of speech and natural language processing over a period of 50 years (1965–2015), comprising 65,000 documents, gathering 50,000 authors, including 325,000 references and representing ~270 million words. Most of these publications are in English, some are in French, German, or Russian. Some are open access, others have been provided by the publishers. In order to constitute and analyze this corpus several tools have been used or developed. Many of them use Natural Language Processing methods that have been published in the corpus, hence its name. The paper presents the corpus and some findings regarding its content (evolution over time of the number of articles and authors, collaborations between authors, citations between papers and authors), in the context of a global or comparative analysis between sources. Numerous manual corrections were necessary, which demonstrated the importance of establishing standards for uniquely identifying authors, articles, or publications

    Dwarfs Sitting on Giants’ Shoulders: How LTs for Regional and Minority Languages Can Benefit from Piggybacking on Major Languages

    Get PDF
    LTs (language technologies) are necessary instruments for all languages, especially for those aiming at conquering a space in digital devices. Languages that are not seriously equipped with LTs face digital extinction in the long run. Many challenges are to be faced to equip minority languages with LTs (from basic to advanced): an almost complete lack of knowledge about available resources and technologies; substantial delays in development of basic technologies; lack of cooperation among minority languages communities; a chronic shortage of funding (in particular for minority languages not officially recognized, which are often the most vital ones on the Internet); and the limited economic value allotted to LTs for minority languages by digital market rules. In this paper we show how these challenges can be overcome, and how coordinated and standardized cooperation among all interested stakeholders can lead to better knowledge and awareness of the breadth and depth of available technologies. ----- Les technologies langagières sont des instruments indispensables pour toute langue, mais surtout pour celles qui visent à se procurer un espace dans les appareils numériques. Des langues qui ne sont pas bien équipées des technologies langagières sont confrontées, à long terme, à une disparition numérique. De nombreux défis sont à adresser pour équiper les langues minoritaires des technologies langagières, telles que (rangés d’élémentaires à avancées): l’absence presque catégorique de connaissance des ressources et technologies disponibles; des retards considérables dans le développement des technologies de base; un manque de coopération parmi les communautés linguistiques minoritaires; l’insuffisance chronique de fonds (particulièrement pour les langues minoritaires qui ne sont pas officiellement reconnues, bien qu’elles soient parmi les plus importantes sur l’internet); et la valeur économique limitée attribuée aux technologies langagières pour les langues minoritaires de la part des règles du marché numérique. Dans cet article, nous montrons comment ces défis peuvent être surmontés, et comment une coopération coordonnée et standardisée peut entraîner une meilleure connaissance et conscience de l’étendue et de la profondeur des technologies disponibles

    Chromatin Profiles of Chromosomally Integrated Human Herpesvirus-6A

    Get PDF
    Human herpesvirus-6A (HHV-6A) and 6B (HHV-6B) are two closely related betaherpesviruses that are associated with various diseases including seizures and encephalitis. The HHV-6A/B genomes have been shown to be present in an integrated state in the telomeres of latently infected cells. In addition, integration of HHV-6A/B in germ cells has resulted in individuals harboring this inherited chromosomally integrated HHV-6A/B (iciHHV-6) in every cell of their body. Until now, the viral transcriptome and the epigenetic modifications that contribute to the silencing of the integrated virus genome remain elusive. In the current study, we used a patient-derived iciHHV-6A cell line to assess the global viral gene expression profile by RNA-seq, and the chromatin profiles by MNase-seq and ChIP-seq analyses. In addition, we investigated an in vitro generated cell line (293-HHV-6A) that expresses GFP upon the addition of agents commonly used to induce herpesvirus reactivation such as TPA. No viral gene expression including miRNAs was detected from the HHV-6A genomes, indicating that the integrated virus is transcriptionally silent. Intriguingly, upon stimulation of the 293-HHV-6A cell line with TPA, only foreign promoters in the virus genome were activated, while all HHV-6A promoters remained completely silenced. The transcriptional silencing of latent HHV-6A was further supported by MNase-seq results, which demonstrate that the latent viral genome resides in a highly condensed nucleosome-associated state. We further explored the enrichment profiles of histone modifications via ChIP-seq analysis. Our results indicated that the HHV-6 genome is modestly enriched with the repressive histone marks H3K9me3/H3K27me3 and does not possess the active histone modifications H3K27ac/H3K4me3. Overall, these results indicate that HHV-6 genomes reside in a condensed chromatin state, providing insight into the epigenetic mechanisms associated with the silencing of the integrated HHV-6A genome

    The strategic impact of META-NET on the regional, national and international level

    Get PDF
    This article provides an overview of the dissemination work carried out in META-NET from 2010 until 2015; we describe its impact on the regional, national and international level, mainly with regard to politics and the funding situation for LT topics. The article documents the initiative's work throughout Europe in order to boost progress and innovation in our field.Peer ReviewedPostprint (author's final draft

    Image Test Libraries for the on-line self-test of functional units in GPUs running CNNs

    Get PDF
    The widespread use of artificial intelligence (AI)-based systems has raised several concerns about their deployment in safety-critical systems. Industry standards, such as ISO26262 for automotive, require detecting hardware faults during the mission of the device. Similarly, new standards are being released concerning the functional safety of AI systems (e.g., ISO/IEC CD TR 5469). Hardware solutions have been proposed for the in-field testing of the hardware executing AI applications; however, when used in applications such as Convolutional Neural Networks (CNNs) in image processing tasks, their usage may increase the hardware cost and affect the application performances. In this paper, for the very first time, a methodology to develop high-quality test images, to be interleaved with the normal inference process of the CNN application is proposed. An Image Test Library (ITL) is developed targeting the on-line test of GPU functional units. The proposed approach does not require changing the actual CNN (thus incurring in costly memory loading operations) since it is able to exploit the actual CNN structure. Experimental results show that a 6-image ITL is able to achieve about 95\% of stuck-at test coverage on the floating-point multipliers in a GPU. The obtained ITL requires a very low test application time, as well as a very low memory space for storing the test images and the golden test responses
    corecore